Linear programming based computational technique for leukemia classification using gene expression profile

Cancer is a serious public health concern worldwide and is the leading cause of death. Blood cancer is one of the most dangerous types of cancer. Leukemia is a type of cancer that affects the blood cell and bone marrow. Acute leukemia is a chronic condition that is fatal if left untreated. A timely, reliable, and accurate diagnosis of leukemia at an early stage is critical to treating and preserving patients’ lives. There are four types of leukemia, namely acute lymphocytic leukemia, acute myelogenous leukemia, chronic lymphocytic in extracting, and chronic myelogenous leukemia. Recognizing these cancerous development cells is often done via manual analysis of microscopic images. This requires an extraordinarily skilled pathologist. Leukemia symptoms might include lethargy, a lack of energy, a pale complexion, recurrent infections, and easy bleeding or bruising. One of the challenges in this area is identifying subtypes of leukemia for specialized treatment. This Study is carried out to increase the precision of diagnosis to assist in the development of personalized plans for treatment, and improve general leukemia-related healthcare practises. In this research, we used leukemia gene expression data from Curated Microarray Database (CuMiDa). Microarrays are ideal for studying cancer, however, categorizing the expression pattern of microarray information can be challenging. This proposed study uses feature selection methods and machine learning techniques to predict and classify subtypes of leukemia in gene expression data CuMiDa (GSE9476). This research work utilized linear programming (LP) as a machine-learning technique for classification. Linear programming model classifies and predicts the subtypes of leukemia Bone_Marrow_CD34, Bone Marrow, AML, PB, and PBSC CD34. Before using the LP model, we selected 25 features from the given dataset of 22283 features. These 25 significant features were the most distinguishing for classification. The classification accuracy of this work is 98.44%.


Introduction
Blood is the most important component of the human body, consisting of 55% liquid termed plasma that flows freely through a blood vessel. Plasma primarily aims to transport nutrients, numbers of characteristics, is one of the biggest problems. The generalizable performance of a classifier suffers from having too many characteristics, some of which may be unimportant to analysis. Choosing selective genesis is therefore crucial to enhancing the precision and speed of prediction systems [7]. Choosing an appropriate feature set is critical for developing effective and efficient models, improving comprehensibility, minimising overfitting, and reducing complexity.
This research is based on linear programming classification of leukemia subtypes that accompany cancer diagnosis. Linear programming (LP) approaches may be able to quickly and precisely identify expression patterns. Microarray data has typically been subjected to linear programming with the two different but complementary goals of sample categorization and gene choice. This study uses feature selection and linear programming techniques to classify types of leukemia based on leukemia gene expression data. Given that the dataset used to evaluate gene expression levels had 22283 genes (columns) from 64 samples, and that this was too large, utilizing a feature selection strategy improved the prediction technique's effectiveness (rows). As a result, reducing the data quantity contributed to improving classification performance.
Our classification study is useful to classify the different subtypes of leukemia represented in chosen dataset GSE9476 on leukemia gene expression from CuMiDa through feature selection methods and machine learning techniques.
The remainder of the paper is organised as follows: The literature review is discussed in section II, the proposed methodology with its steps is discussed in section III, the results are discussed in section IV, and the conclusion and future work is discussed in section V.

Literature review
Many researchers have identified and predicted different cancer subtypes using different types of methods. This section discussed the most notable research that made use of gene datasets via machine learning including Linear programming-based leukemia subtypes research. Y. Tang et al. [4] developed an FCM-SVM-RFE Recursive Feature Elimination (RFE) algorithm for predicting AML/ALL gene expression data, which achieved an average accuracy of 92.94%. The Fuzzy C-Means clustering approach was used to group related genes into clusters, and then a Support Vector Machine (SVM) was modeled in each cluster-induced space. This method was more accurate for predicting unknown samples of cancer [2]. Yoo et al. [8] suggested a gene selection and multivariate fuzzy statistical analysis technique for evaluating microarray data from leukemia patients. It was used to analyze the gene expression pattern and investigate the leukemia subtypes whose expression patterns were found to be linked to the cases of acute leukemia gene expression. They used PCA to evaluate ALL and AML patterns. It also eliminates the drawbacks of threshold-based gene selection, such as the impossibility of an unknown subclass selection. Taskesen et al. [9] worked on bringing gene expression profiles (GEP) and DNA methylation profiles (DMP) together. Gene expression profiles, as well as the gene patterns obtained from GEP, can be utilized to predict AML subtypes. Similarly, DNA-methylation profiles were used to make successful predictions. Both have different patterns that aid in the classification of AML subtypes. They employed a logistic regression model with Lasso regularization to predict AML subtypes. He et al. [10] worked on classification methods for leukemia cancer. To efficiently extract high-level data abstraction and transform this quantitative data into fuzzy discrete transactions, authors combined data clustering approaches with fuzzy interval partitioning on given features. These transactions were supplied to the A priori algorithm to mine association rules that supported better classification and decision. Experiments reveal that the FARM-DS mining technique for Fuzzy Association Rules (FARs) has good interpretability since it extracts considerably shorter rules and has great prediction accuracy. Klein et al. [11] presented a novel approach for systematic and rigorous comparison of published gene expression identifiers to a demonstrative given dataset. Identifying related analyses and gene mutations, enhanced the analysis of new microarray data. This technique enables researchers to integrate learnings from multiple microarray experiments into the structured analysis of a new dataset. Stiglic et al. [12] introduced a new method for interpreting tiny ensembles of classifiers using gene expression data called Visual Interpretation of Small Ensembles (VISE). It was proven that interactive interpretation tools, which were created for traditional machine learning challenges, also provide a wide variety of opportunities for researchers in the bioinformatics discipline. They also serve as an interactive tool for experts in the classification process. Feltes et al. presented Curated Microarray Database (CuMiDa) which is a resource that contains 78 cancer Microarray datasets that have been rigorously cross-checked from 30,000 Gene Expression Omnibus (GEO) articles. CuMiDa is a database of datasets dedicated to the testing and benchmarking of machine learning algorithms in cancer research. Feltes et al. observed sample division for this, all data sets were tested using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (tSNE) analyses, as well as various machine learning (ML) approaches including SVM and RF, to provide a base accuracy of 88-85% for the major techniques used for microarray data sets [13]. Bilen et al. have developed a new method for rapidly classifying leukemia cancer microarrays and decreasing data size by focusing on the most important genes. Bilen et al. employed two methods, the ensemble, and the hybrid method. Firstly, a gene filtering algorithm is created using the Wilcoxon rank-sum, Fisher correlation score, and information gain approach to create an ensemble gene selection algorithm. Secondly using an upgraded genetic algorithm, the most successful genes among these genes are exposed in the feature selection phase. Cross-validation findings after the classification process were 100% (LOOCV), 98.57% (5-fold), and 97.14% (10-fold) [14]. To categorize microarray data with a small sample size and a large number of features, Xu et al. used two Modified Linear discriminant analysis techniques. Xu et al. mentioned the reason behind the sub-optimal performance of classical LDA on microarray data in terms of uncertainty and uniqueness of the within-group covariance matrix. The MLDA and NLDA have been used in their study and analyzed that modified LDA techniques work better in classifying data that has a large number of features and small samples when compared with the k-nearest neighbor, diagonal linear discriminant analysis, and classical LDA [15]. When working with high-dimensional data that has a little quantity of labeled data and a significant number of unlabeled data, it is never easy to get better classification results. A semi-supervised sparse Fisher's LDA was proposed by Lu and Qiao. LDA is rebuilt and sparsity is attained using a direct estimation technique. To deal with the no convex loss function related to the unlabeled data, they additionally employ the difference-convex approach. Overall, the suggested strategy improves the LDA method's capability [16]. Feltes et al., manually curated the Gene Expression Omnibus GEO using extensive filtering parameters to select the major homogeneous and high-quality RNA-seq using microarray datasets having several cancer types. TCGA data was used to study frequently unregulated genetic mechanisms behind the tumoral process using machine learning techniques and biological processes. His findings showed that tumor is more closely linked to the overexpression of essential unregulated machinery than to the under-expression of a specific gene [13]. Zhou et al. has been used Neural networks, Bayesian statistics, and a self-organizing map in research. In Microarray datasets, neural networks are best for feature learning and computation. When compared to large feature scales, the sample sizes are found to be insufficient. Because dealing with such high-dimensional, small-sample-size data is tough, a combination of BNN and SOM can perform well, particularly in classification problems involving gene expression-related disorders. The self-organizing map is best for dimension reduction, whereas Bayesian statistics are used to estimate feature ambiguity from the posterior distribution [17]. Grisci et al. worked on a novel strategy that uses Neuroevolution as a machine-learning method to classify microarray data and choose more relevant genes at the same time. The author used the FS-NEAT algorithm. In addition, quality microarray datasets were selected using a strict filtering and preprocessing approach. When evaluated with microarray datasets of three different forms of cancer with variable numbers of samples, characteristics, and classes, the Grisci et al. approach reduced the number of dimensions in all datasets by over 99.9%. The use of the features chosen by his method improved the performance of algorithms [18]. Liu et al. used basic particle swarm optimization (PSO) to identify acute leukemia samples with 96.43 percent accuracy. It was compared to K-means clustering, and the findings showed that PSO performs better than K-means, but stability is flipped [19]. Karim et al. introduced a deep learning-based gene expression data classification method that used the Grey Wolf Optimizer (GWO) to train Sparse Auto-Encoders via an unsupervised training process. Auto-Encoders (AE) has a unique property that allows them to extract high-level attributes from row data, and thus they achieved 98.99% accuracy. Under the same test conditions and for the same datasets, in this research, the GWO method results has been compared with some other against extensive Meta heuristic algorithms such as Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), and Genetic Algorithms (GA). Sparse Auto Encoders trained on GWO outperform both conventional approaches and the abovementioned techniques [20]. Sun et al. [21] proposed a model by using gene expression patterns for cancer and other gene-disease categorization. This field of clinical diagnosis is becoming increasingly important for accurate cancer diagnosis and the identification of cancer subtypes. To increase the accuracy of microarray data outputs, authors suggested a gene selection strategy based on Fishers Linear Discriminant (FLD) and neighborhood rough set (NRS). Fisher's Linear Discriminant technique was useful in reducing generic data so that a potential gene subset with strong classification capability could be obtained. After that, Sun et al. worked on defining neighborhood roughness and precision in a neighborhood decision system. Experiments showed that Sun et al. proposed a strategy that can pick a smaller and well-classified gene sample and improve classification results. W. Tang et al [22] proposed a novel compressive sensing (CS)-based technique for leukemia subtyping to classify ALL and AML. The CS method, a new technique for computational and statistical signal analysis, allows signals to be recovered from a small number of incoherent projections. To determine the class, the LOO method was used, which allows for signal reconstruction from a small number of incoherent signal projections. It uses fewer computations and resources and achieves 97% classification accuracy.
Silva et al. compared three distinct machine-learning models and data mining techniques to diagnose acute myeloid leukemia and acute lymphoblastic leukemia on gene expression data. The primary algorithm was the support vector machine, the second was the artificial neural network, and the third was the machine learning ensemble, which is a collection of various intelligence algorithms (Artificial Neural Network, Support Vector Machine, Random Forest, Gradient Boosting, and k-NN). The learning ability and classifying potential of the Ensemble model were consistent, and it performed better than 94% in classifying AML and ALL leukemia types [23]. CuMiDa is a valuable resource in cancer biomedical research for benchmarking machine learning techniques in leukemia gene expression analysis [24,25]. The existing literature has been summarized in Table 1.

Proposed methodology
The proposed methodology for the classification of leukemia consists of the following steps including data acquisition, data preparation, feature selection, and design of the classifier. A detailed flow diagram of the proposed model is given below in Fig 1.

Data acquisition
The datasets Leukemia gene expression-CuMiDa [27] has been used in this methodology for the classification of leukemia. The details of the selected datasets are given in Table 2.
In this dataset, there is a total of 64 leukemia samples including 8 cases of Bone_Mar-row_CD34, 10 cases of Bone_Marrow, 26 cases of AML, 10 cases of PB, and 10 cases of PBSC_CD34. It consists of 5 classes, these five types include Bone_Marrow_CD34, Bone_Marrow, AML, PB, and PBSC_CD34.
FARM-DS Mining fuzzy association rules for leukemia classification.
Better classification and decision support.

Data preparation
After acquiring the dataset the next step was understanding its features and types. The chosen Dataset GSE94769 is a numeric dataset. We must first display the dataset to understand the behavior of the characteristics and make predictions regarding anomalies. For this, we used MATLAB R2021a environment for performing tasks and operations on our dataset. This study was implemented by using the open-source platform MATLAB. We displayed the whole data for each class to identify any outliers or abnormalities. Additionally, visualization greatly aids in data understanding and interpretation, allowing us to use the proper machine-learning techniques and algorithms to create computationally robust models. The data is uniform and has few outliers. Leukemia data is split into two parts, testing and training data.

Feature selection
As the microarrays hold enormous potential for accelerating the discovery of new biological information since they can concurrently measure the expression levels of thousands of genes. One characteristic of microarray data is that there are many more variables 'P' (genes) than sample size 'N'. As in the case of our dataset, there are N = 64 samples and P = 22283 genes in total. In limiting the size of the feature set, thus we must find an appropriate gene selection strategy for our microarray dataset so that the feature size should be reduced. Total 25 data features have been obtained from leukemia gene expression dataset. These are the most distinguishing features that are useful for classification. As biomedical problems related to genes are complex and it is difficult to build a perfect model, the ideal case gives near about 100% classification accuracy. Different feature extraction techniques were recommended in the literature to improve classification rates and lower processing costs for identifying important genes. So, we improved our results by extracting those features that give better data separability. Reducing the features helped improve prediction performance in terms of speed and accuracy. The process of choosing the best suitable subset of characteristics is known as feature extraction; need to select a subset of characteristics that contributes more to the best classification, hence features should be prioritized according to their importance in the classification problem. Feature selection algorithm. The following algorithm was used for feature extraction in this study.
Let y 2 R 64 be a feature vector of the data matrix and z 2 R 64 be the same feature vector in the transformed domain.
Where α and β are constant, and α > β We establish maps to segregate z according to the classes. The feature vector z is a noralization form of y.
Where z 1 2 R 8 , z 2 2 R 10 , z 3 2 R 26 , z 4 2 R 10 , z 5 2 R 10 . Vectors z 1 , z 2 , z 3 , z 4 , and z 5 are split of z with respect to the classes labels. Let � z i be mean of a vector z i , we define For i = 1, 2, 3, 4, 5 Where N 1 , N 2 , N 3 , N 4 and N 5 are 8, 10, 26, 10 and 10 respectively. Now, we concatenate through a non-linear map "g" such that: Let d be a measure used for the selection of features.
for i = 1, 2,. . ., 22283 Where x j is the value in the zero vector. We are doing this for every feature. Training on test data. Consider classes 1,2,3,4 and 5 that have the number of samples n 1 , n 2 , n 3 , n 4 , and n 5 respectively.
Where i 2 C and � D i 2 R k�n i Consider two datasets � D i and � D j such that i 6 ¼ j and i, j 2 C. We have to perform two tasks. Firstly, we have to find whether � D i and � D j are linearly separable. Secondly, if both are separable then find the separating hyperplanes P ij . For Linear separability: We define Both classes i and j are linearly separable if f x 2 R k and b 2 R1 such that:   (12) is and We introduce variables as y i � 0 (i = 1,2,.. . .. n i ) and y j � 0 (j = 1,2,.. . .. n j ) Such that and Writing these equations in the matrix form: Vector 1 a has 1 a-times.
We introduce constraints in the standard form of an LP from the above equations The objective function is defined as: Planes testing. Let P ij be a place between class i and j (i 6 ¼ j and j 2 C). The list of planes is given below in Table 3. Let Q * ij be an optimal threshold that gives maximum correct binary classification decisions for plane P ij for the entire data from y ij . These are considered as bias for the plane P ij.s

Results
In this section, we provide our findings on leukemia subtypes classification using microarray data. Our method is based on linear programming. The best selected of the observed features are those that are extremely clearly distinct, and these features are more useful for classification and diagnosing the various subtypes of leukemia. From the dataset of 22283 features, we have chosen those features that satisfied the following two goals: 1. Distinguish those traits that make it easier to data separability. 2. Decide which qualities are most useful for classifying new data. Leukemia 64 samples overall, comprising 8 instances of Bone Marrow CD34, 10 cases of Bone Marrow, 26 cases of AML, 10 cases of PB, and 10 cases of PBSC CD34, are included in our dataset. The feature size of our dataset, 22283 genes' expression levels, is included in the dataset. The gene expression profiles' values are contained in these characteristics (GEP). Each of these characteristics aids in placing a sample into a certain class. Our dataset includes five distinct classifications or subtypes of leukemia. The proposed model utilizes these five categories-Bone Marrow CD34, Bone Marrow, AML, PB, and PBSC CD34-to identify the class to which each sample belongs. The 22283 features in our dataset contribute significantly to the curse of dimensionality, which is caused by small sample sizes relative to huge numbers of characteristics. When we compute such data, the curse of dimensionality will require our time, memory, and effort. The selected extracted features details have been given in Table 4. Feature Number is the serial number of features in the whole dataset as the dataset contains 22283 features in total. Probe Set ID is the label of each feature in the used dataset. It is also a unique number allotted to a specific and relevant group of genes in genetic engineering. The results of 25 selected features are drafted in a table given below in Table 5. The total number of samples in each class and divided into testing and training samples. The training and testing samples provide 60% and 40%, respectively. Pairwise precision classification on training and testing samples yields 100% and 98.44% accuracy, respectively. Pairwise precision for testing samples have been discussed in Table 6. Table 7 describes, pairwise classification. Class 1 is initially evaluated against classes 2, 3, 4, and 5. Then class 2 is tested using (class 3, class 4, and class5). Then Class 3 is examined using (Class 4 and Class 5). Finally, Class 4 and Class 5 are put to the test. Precision is calculated at each stage of this classification. Class 1 & 2 are initially assessed with pairwise classification by combining two classes (class 3, class 4, and class 5). Then class 3 is tested using (class 4 and class5). Then Class 4 is compared to Class 5. Table 7 depicts the gene expression levels of 22283 genes from 64 samples (rows). Table 8 lists the pairwise classification plane values. We analyzed the performance of pairwise classification, which was initially developed to reduce multi-class issues to two-class problems. Paired classification is also advantageous for computationally expensive learning approaches. Instead of initially attempting to arrange the items, pairwise comparisons between the individual items and later adding the wins for each item make it simpler for a human to discern the order between the n items. Table 9 shows the pairwise classification plane values obtained by combining Class 1 and Class 2. These plane values are employed in the classification of testing samples following class merging. As previously explained, pairwise classification yields binary classification, which reduces a multi-class problem to a two-class problem.  Table 10 shows binary classification with planes and a threshold. It includes Leukemia used as a threshold for classes represented by numbers 1,2,3,4, and 5. Ҩ is used as a threshold. P No stands for Plane Number. Output classification demonstrates class partition. Fig 2 depicts the categorization of testing samples utilizing ten planes at the same time. It discusses the pairwise categorization of classes by matching them. Identify misclassified samples as well. Fig 2, the first half of the picture, shows five sub-images. Fig 3 depicts the categorization of testing samples utilizing ten planes at the same time. It discusses the pairwise categorization of classes by matching them. Identify misclassified samples as well. Fig 3, the second half of the whole picture, shows five sub-images.   (Table 9). Fig 7 depicts planes that are utilized for binary classification. This is done by merging two classes and comparing them with all other classes. The objective of binary classification is to divide the items of a set into two groups (each termed class) based on a classification rule.  Pairwise precision classification on training samples yields 100% accuracy. Pairwise precision classification on testing samples gives 98.44% accuracy. We improved our results by   Table 8).
extracting data separation features. Reduced feature count improved prediction performance in terms of accuracy as well as speed. The confusion matrix is given in Table 11 below. The accuracy, precision, recall, and F1 score have been given the Table 12.
The comparison results of the proposed model have been discussed in Table 13.

Conclusion and future work
Targeting particular treatments for various categories of leukemia patients is one of the biggest medical problems. Improvements to classification models have made them crucial for better cancer treatment. In this work, Linear programming computational models were used to establish the diagnosis of various leukemia subtypes, such as Bone Marrow CD34, Bone Marrow, AML, PB, and PBSC CD34. Leukemia gene expression data from CuMiDa was employed.
To make our diagnosis computationally fast, we first rescaled the dataset's 22283 features and then we chose the most important features technique. The most significant 25 features were selected that have high discrimination power. This study improved the accuracy of the dataset by 98%. Linear Programming models play an important role in the classification of leukemia subtypes. Our model's overall performance was outstanding. This work contributed to the revelation that when leukemia subtypes are accurately classified and data is fitted with high classified accuracy, cure rates increase and unnecessary toxicities decrease. Because the patient will be able to take preventative measures and doctors will be able to spot the condition earlier. In the future, we can predict such data that have more samples and more subtypes of leukemia. So those types that are not addressed in this study can also be addressed. The expansion of datasets, which will give us access to more samples in the future, brings with it some new challenges. We are able to predict more accurate and complicated classifiers. Reducing the number of created classifiers while using a large number of them simultaneously, as is the case with ensembles of classifiers, is thus one of the primary goals for the future. It is challenging to consistently and precisely classify cancerous cells while avoiding overfitting because of a lack of data, digitization problems, and the curse of dimensionality.